16-726 Assignment 5

Overview

In this assignment, we explore GAN inversion and guided image generation using StyleGAN and stable diffusion. For StyleGAN, we use scribbles as input and enforce perceptual and L1 losses between the image guide and the generated image to optimize the latent code. For stable diffusion, we use the noisy image guide as initial latent for the denoising process.

Part 1: Inverting the Generator

Losses

Results with different loss weights. Left most: original image. From left to right: perceptual loss weight = 0, 0.003, 0.01, 0.03, 0.1. Right most: perceptual loss only. L1 loss weight is fixed to 10.

We tested different loss weights for the L1 and the perceptual loss, and different weights for L2 regularization on delta. We compute the perceptual loss using the L2 distance of the conv_5 features. All latents are optimized for 5000 iterations using L-BFGS.
Using L1 loss only gives blurry reconstruction results as the loss is averaged over all pixels and does not penalize local errors that much. Using perceptual loss only results in slightly worse details than using the combination, e.g., at the top right corner, as the convolutional features used might not be able to capture all the details in the image.
When increasing the weight for the perceptual loss, the change is subtle and a weight of 0.003 or 0.01 seems to give reasonable results.

Results with different L2 regularization weights. Left most: original image. From left to right: L2 regularization weight = 0, 0.001, 0.01, 0.1, 1.

We also applied L2 regularization by penalizing the L2 norm of delta, averaged over the batch dimension and the ws dimension if using w+ latents. With strong regularization, e.g., weight = 1, delta is very close to zero and thus the image is close to the one generated with the initial latents. Small weights like 0.001 gives results similar to the one without regularization, while larger weights typically lead to blurrier details.

GAN Choices

Results with different GANs. From left to right: original image, vanilla GAN, StyleGAN with w+.

Vanilla GAN gives much worse reconstruction results than StyleGAN, probably because the generator is not as powerful, and the learned manifold of the generated images are not further away from the real image manifold comparing to StyleGAN.

Latent Spaces

Results with different latent spaces. From left to right: original image, StyleGAN with z, w, w+ latent space. Top row: random initialization. Bottom row: initialization with mean.

When using the z latent space for StyleGAN, the optimization is significantly harder as the gradients need to be backpropagated through the additional mappling network, and the reconstruction images is very close to the initial image instead of the input image. Using w or w+ improves the results significantly, where w+ gives better results at the details and the background, as w+ is more expressive and flexible to optimize.
Using the mean initialization helps with the z space latents, probably because the expected distance between the mean and the optimal latents is smaller. For w and w+, the mean initialization gives similar results as random initialization, given that the optimization is relatively easier.

Overall, we choose to use StyleGAN with w+ latents, perceptual loss weight = 0.01 on the conv_5 layer, L1 loss weight = 10, without L2 regularization for the following experiments.
When optimizing for 5000 iterations for one image, vanilla GAN takes 25.21s, StyleGAN with z takes 56.99s, StyleGAN with w takes 81.84s, and StyleGAN with w+ takes 81.25s.

Part 2: Scribble to Image

Results using provided scribbles. Top row: scribbles. Bottom row: generated images.

Results with custom scribbles. Top row: scribbles. Bottom row: generated images. Pairs of scribbles with sparse and dense scribbles are shown.

From the results, denser scribbles with filling colors give better generated images especially for the face details. Having only the sparse scribbles around the silhouette of the cat usually results in darker and less detailed face regions. The reason might be that the latents controlling these unmasked regions are not explicitly supervised using the guiding scribbles, but implicitly affected during the optimization process of the masked regions. The latents might then overfit to the masked losses and generate less realistic results in the unmasked areas.
We also notice that using less natural colors for the scribbles, e.g., glowing blue for the eyes, might lead to severe overfitting and unrealistic results if optimizing for large number of iterations, whereas natural colors that conform to the real image distribution typically do not show this issue. This might be because the optimized latents for images with unnatural colors gradually shift away from the real (training) image manifold, where the generator fails to generate realistic images.

Part 3: Stable Diffusion

Guided generation with stable diffusion and text prompt "a cat". From left to right: image guide, N=500, N=600 (700 for bottom row).

We tested the guided generation by adding noise corresponding to different time steps to the guiding image. Starting with larger time steps, e.g., N=600 or 700, will lead to more flexible generation results that follow less the guiding image. Starting with smaller time steps such as N=500 generates images that are more similar to the guiding image.

Guided generation with different classifier-free guidance strength. From left to right: image guide, strength=5, 15, 50. Text prompt: "a cat".

When varying the strength of the classifier-free guidance, we found that a larger strength leads to results that are more stylized and less similar to the guiding image. In the examples above, smaller guidance strength results in images that are more similar to the guiding sketch, with flat color strokes, whereas larger strength results in images closer to the text prompt, with more detailed brush strokes and textures.

Bells & Whistles

DDIM Sampler

Results using DDIM sampler. From left to right: image guide, DDPM results, DDIM with S=5, 10, 25, 50 steps. N=500.

We implemented the sampler in DDIM and tested the guided generation with different number of denoising steps S. We use the linear subsequences of the time steps with 1/10, 1/20, 1/50, and 1/100 of the total steps starting from N. In the first example, fewer denoisings steps results in images that are more similar to the guiding image, while S=25 or 50 gives very similar results. The DDIM results seem to match the guiding image better in shape but less in color, comparing to DDPM.
In the second example. DDIM with different number of steps gives similar results, which look more similar to the guiding image than DDPM in terms of the patch patterns and the colors. In general, DDIM generates images with comparable quality to DDPM but much faster.

GAN Latents Interpolation

Images from interpolated StyleGAN latent code.

We tried linearly interpolating between the latent code of the StyleGAN model inverted from two different images, and the results are shown above. We use the w+ space and 1000 optimization steps for inversion.

Acknowledgements

The website template was borrowed from Michaël Gharbi and Ref-NeRF.